Fitting a XG Boosting classifier
- Extreme Gradient Boosting (XGBoost): XGBoost is an implementation of Gradient Boosted Trees that incorporates a series of improvements resulting in superior performance (both in terms of evaluation metrics and time). Since being published, the algorithm was successfully used to win many data science competitions. In this recipe, we only present a high-level overview of the distinguishable features. For a more detailed overview, please refer to the original paper or documentation. The key concepts of XGBoost are:
- XGBoost combines a pre-sorted algorithm with a histogram-based algorithm to calculate the best splits. This tackles a significant inefficiency of Gradient Boosted Trees, that is, for creating a new branch, they consider the potential loss for all possible splits (especially important when considering hundreds, or thousands, of features).
- The algorithm uses the Newton-Raphson method for boosting (instead of gradient descent)—it provides a direct route to the minimum/minima of the loss function.
- XGBoost has an extra randomization parameter to reduce the correlation between the trees.
- XGBoost combines Lasso (L1) and Ridge (L2) regularization to prevent overfitting.
- It offers a different (more efficient) approach to tree pruning.
- XGBoost has a feature called monotonic constraints (that other models, such as LightGBM, lack)—the algorithm sacrifices some accuracy, and increases the training time to improve model interpretability.
- XGBoost does not take categorical features as input—we must use some kind of encoding.
- The algorithm can handle missing values in the data.
 
How to do it...
Execure the following steps to fit a decision tree classifier
- Import the libraries:
from xgboost import XGBClassifier
from sklearn import metrics
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from imblearn.metrics import specificity_score
- Create an instance of the model, fit it to the training data, and create the prediction:
xgb_model = XGBClassifier(random_state=42)
xgb_model.fit(X_train_ros, y_train_ros)
y_pred = xgb_model.predict(X_test)
- Evaluate the results:
accuracy = accuracy_score(y_test, y_pred)
roc_score = roc_auc_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average='macro')
recall = recall_score(y_test, y_pred, average='macro')
f1 = f1_score(y_test, y_pred, average='macro')
kappa = specificity_score(y_test, y_pred)
print("Accuracy:", accuracy)
print("ROC Score:", roc_score)
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Specificity Score", kappa)
Accuracy: 0.9518914879517223 ROC Score: 0.9466626727047515 Precision: 0.6414053088347277 Recall: 0.9466626727047515 F1-Score: 0.7056395089417349 Specificity Score 0.9521069627946116
How it works...
- In Step 1, we import library
- In Step 2, we used the typical scikit-learn approach to train a machine learning model. First, we created the object of the XGBClassifier class (using all the default settings). Then, we fitted the model to the training data (we needed to pass both the features and the target), using the fit method. Lastly, we obtained the predictions by using the predict method.
- In Step 3, we evaluated the performance of the model. We used a custom function to display all the results. We will not go deeper into its specifics, as it is quite standard and is built using functions from the metrics module of scikit-learn. For a detailed description of the function, please refer to the accompanying GitHub repository.
- The confusion matrix summarizes all possible combinations of the predicted values as opposed to the actual target. It has a structure that looks like the following:
TN | FP
FN | TP
- The values are as follows:
- True positive (TP): The model predicts a default, and the client defaulted.
- False positive (FP): The model predicts a default, but the client did not default.
- True negative (TN): The model predicts a good customer, and the client did not default.
- False negative (FN): The model predicts a good customer, but the client defaulted.
 
- Using these values, we can further build multiple evaluation criteria:
- Accuracy ((TP + TN) / (TP + FP + TN + FN))—Measures the model's overall ability to correctly predict the class of the observation.
- Precision (TP / (TP + FP))—Out of all predictions of the positive class (in our case, the default), how many observations indeed defaulted.
- Recall (TP /(TP + FN))—Out of all positive cases, how many were predicted correctly. Also called sensitivity or the true positive rate.
- F-1 Score—A harmonic average of precision and recall. The reason for a harmonic mean instead of a standard mean is that it punishes extreme outcomes, such as precision = 1 and recall = 0, or vice versa.
- Specificity (TN / (TN + FP))—Measures what fraction of negative cases (clients without a default) actually did not default.
 
- Understanding the subtleties behind these metrics is very important for the correct evaluation of the model's performance. Accuracy can be highly misleading in the case of class imbalance. Imagine a case when 99% of data is not fraudulent and only 1% is fraudulent. Then, a naïve model classifying each observation as non-fraudulent achieves 99% accuracy, while it is actually worthless. That is why, in such cases, we should refer to precision or recall. When we try to achieve as high precision as possible, we will get fewer false positives, at the cost of more false negatives. When optimizing for recall, we will achieve fewer false negatives, at the cost of more false positives. The metric on which we try to optimize should be selected based on the use case.
There's more...
- The ROC curve loses its credibility when it comes to evaluating the performance of the model when we are dealing with class imbalance. That is why, in such cases, we should use another curve—the Precision-Recall curve. That is because, for calculating both precision and recall, we do not use the true negatives, and only consider the correct prediction of the minority class (the positive one).
# Calculate precision and recall for different thresholds:
y_pred_prob = xgb_model.predict_proba(X_test)[:, 1]
precision, recall, thresholds = metrics.precision_recall_curve(y_test, y_pred_prob)
# Having calculated the required elements, we can plot the curve:
ax = plt.subplot()
ax.plot(recall, precision, label=f'PR-AUC = {metrics.auc(recall, precision):.2f}')
ax.set(title='Precision-Recall Curve', xlabel='Recall', ylabel='Precision')
ax.legend()
- As a summary metric, we can approximate the area under the Precision- Recall curve by calling metrics.auc(recall, precision). In contrast to the ROC- AUC, the PR-AUC ranges from 0 to 1, where 1 indicates the perfect model. A model with a PR-AUC of 1 can identify all the positive observations (perfect recall), while not wrongly labeling a single negative observation as a positive one (perfect precision). We can consider models that bow towards the (1, 1) point as skillful.